Data Containers

A place for everything

Rodney J. Dyer

Vectors

A Vector

A vector is a storage container for data of a uniform data class type.

x <- c( 1, 2, 3, 4 )

All programmers are lazy and know that the fewer key presses you have to make, the less likely an error will be introduced. So the function named combine is represented as c().

x
[1] 1 2 3 4

Its job is to smush a bunch of data into a simple container.

Accessing Elements

A vector contains similar data types and each element can be accessed using numerical indices nested with square brackets [ & ].1

x[2]
[1] 2

Introspection

Because x is a vector AND it contains numeric data, the introspection operators for both vector and numeric will return TRUE.

 

is.vector( x )
[1] TRUE
is.numeric( x )
[1] TRUE

 

The data in x ARE both vectors and numeric types.

Other Data Types

As long as the base data type is the exact same, vectors will always work properly.

c("A", "B", "C", "The Cat jumped over the moon")
[1] "A"                            "B"                           
[3] "C"                            "The Cat jumped over the moon"
c( TRUE, FALSE, FALSE, TRUE)
[1]  TRUE FALSE FALSE  TRUE

No Mixing Allowed

You CANNOT mix data types in a single vector and keep the same kinds of data. R will coerce to a least common data type so that they are all of the same type.

c( 1, TRUE, FALSE, 23)
[1]  1  1  0 23

 

c( 1, TRUE, "FALSE", 23)
[1] "1"     "TRUE"  "FALSE" "23"   

Sequences

Sometimes it is helpful to make a a sequence of values in a vector. R has some built-in functionality here for that.

 

Sequence Operator

w <- 1:6
w
[1] 1 2 3 4 5 6

The seq() function

x <- seq(10,30, by=3)
x
[1] 10 13 16 19 22 25 28

 

LETTERS

y <- LETTERS[1:5]
y
[1] "A" "B" "C" "D" "E"

The seq() function (again)

z <- seq(10,30, length.out = 6)
z
[1] 10 14 18 22 26 30

Vector Operators

Data within vectors can be subjected to unary opertors.

 

-z
[1] -10 -14 -18 -22 -26 -30

 

!z
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Vector Operators

As well as binary operators.

 

w + z 
[1] 11 16 21 26 31 36
z^w
[1]        10       196      5832    234256  11881376 729000000

Recycling Rule

If you attempt to perform a binary operator on two vectors whose lengths are different, it will recycle the values in the shorter one.

c(1,2,3) + c(10,20,30,40,50,60)
[1] 11 22 33 41 52 63

 

But if the lengths are not clean multiples, R will give you a warning (but still give you an answer).

c(1,2,3,4) + c(10,20,30,40,50,60)
[1] 11 22 33 44 51 62

In-Class Activity

For a random and entirely made up example of a homework assignment, the participants raw scores were recored as: 32, 31, 45, 29, 17, 40, 26, and 23. This was out of 45 total points. In R, do the following:

  1. Create a vector of the scores and assign it to a variable of suitable nomenclature.
  2. Use the functions min(), sum(), max(), and mean() to derive these mathematical propoerties.
  3. Standardize the scores as a percentage.
  4. How many of each letter grade were achieved?

Matrices

2-Dimensional Vectors

For some mathematical operations, we need to work with matrices. These are another ‘general’ container but with dimensions for rows and columns of data.

matrix( 1:9, ncol=3 )
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
matrix( LETTERS[1:9], nrow=3)
     [,1] [,2] [,3]
[1,] "A"  "D"  "G" 
[2,] "B"  "E"  "H" 
[3,] "C"  "F"  "I" 

2-Dimensional Vectors

Creating matrices are done columnwise, if you want them to be rowwise, you have to ask for it.

matrix( 1:9, ncol=3 )
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
matrix( 1:9, ncol=3, byrow = TRUE )
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

Indices

Just like vectors, the square brackets are used to access values within a matrix. However, there are now two indices, one for the row and one for the column.

 

X <- matrix( 1:9, ncol=3, byrow = TRUE )
X
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
X[1,3] <- 42
X
     [,1] [,2] [,3]
[1,]    1    2   42
[2,]    4    5    6
[3,]    7    8    9

Slicing

You can get an entire row or column using what is called a slice index.

X
     [,1] [,2] [,3]
[1,]    1    2   42
[2,]    4    5    6
[3,]    7    8    9

 

X[,2]
[1] 2 5 8
X[1,]
[1]  1  2 42

Matrix Operators

Arithamatic operators on matrices work the same way (as long as they are matrices of the proper number of rows and columns).

 

X <- matrix( 1:4, ncol=2)
X
     [,1] [,2]
[1,]    1    3
[2,]    2    4
Y <- matrix( c(3,5,7,9), ncol=2 )
Y
     [,1] [,2]
[1,]    3    7
[2,]    5    9

Binary Operators

X + Y
     [,1] [,2]
[1,]    4   10
[2,]    7   13
X * Y
     [,1] [,2]
[1,]    3   21
[2,]   10   36

This is element-wise multiplication (aka a Kronecker Product).

Matrix Multiplication

Matrix multiplication is a bit more complicated as it is a slightly more involved .

 

X %*% Y 
     [,1] [,2]
[1,]   18   34
[2,]   26   50

In-Class Activity

For most of you, this will be the only time you’ll be working with matrices (so soak in the glory of the moment it is all non-matrix R reality from here on out!). Using the a sequence of numbers from 20 to 42:

  • Create matrix with 4 rows.
  • Create a matrix with 5 rows.
  • What happens when you try to create one with 6 rows?
  • Show what the optional argument byrow does to a 4x5 matrix (e.g., try both byrow=TRUE and byrow=FALSE).

Lists

Lists

Lists are more versatile containers in that they allow you to store different kinds of data in them.

By default, they are numerically indexed .

lst <- list( "Bob", 32, TRUE )
lst
[[1]]
[1] "Bob"

[[2]]
[1] 32

[[3]]
[1] TRUE

Double Square Brackets

Notice that lists use two sets of square brackets instead of one—to differentiate itself from a normal vector

lst[[1]] 
[1] "Bob"
lst
[[1]]
[1] "Bob"

[[2]]
[1] 32

[[3]]
[1] TRUE

Why The Double Brackets?

This is because technically, the first element in the list is an also a list and what we are trying to get from that is the first element inside that contained list.

c( class(lst), class(lst[1]), class(lst[[1]]) )
[1] "list"      "list"      "character"

Named Lists

Lists can be made more friendly to you by using actual names for the keys associated with each value. In some languages, like python, these are referred to as dictionaries.

info <- list("Name" = "Bob", "Age" = 42)
info
$Name
[1] "Bob"

$Age
[1] 42

Notice the use of the $ in the output

Named Lists

This $ notation is used to easily grab the contents of the list at that slot.

 

info$Name <- "Robert"
info
$Name
[1] "Robert"

$Age
[1] 42

Named Lists

As well as to add new entries to the list directly.

 

info$PassedDyersClass <- TRUE
info
$Name
[1] "Robert"

$Age
[1] 42

$PassedDyersClass
[1] TRUE

Square Brackets Also Work

You can also use the double brackets AND the name of the key as a reference.

info[["Name"]]
[1] "Robert"

However this is even more work and looks a bit less elegant than the $ notation. Also, if you look at the order of operations, you’ll see that the $ notation has a higher precedence in operations than the single or double brackets (see ?Syntax).

Lists are Ubiquitous

In R, you will most likely work with list objects as analysis results rather than as a container to keep your data. Almost all analyses return their values as a list with the included components. Here is an example.

head( iris )
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Iris Data Raw

summary( iris )
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Iris Data Visualized

library( ggplot2 )
ggplot( iris, aes(Sepal.Length, Petal.Length, color=Species) ) + 
  geom_point() + theme_minimal()

Correlation

Here is a quick correlation between the sepal and pedal lengths in some iris data set.

iris.test <- cor.test( iris$Sepal.Length, iris$Petal.Length )
iris.test

    Pearson's product-moment correlation

data:  iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8270363 0.9055080
sample estimates:
      cor 
0.8717538 

Packaging of Results

is.list( iris.test )
[1] TRUE
class( iris.test )
[1] "htest"

What is hidden inside?

                                                        Values
statistic.t                                   21.6460193457598
parameter.df                                               148
p.value                                   1.03866741944978e-47
estimate.cor                                 0.871753775886583
null.value.correlation                                       0
alternative                                          two.sided
method                    Pearson's product-moment correlation
data.name              iris$Sepal.Length and iris$Petal.Length
conf.int1                                    0.827036329664362
conf.int2                                    0.905508048821454

Custom Printing

Printing results show the components of the analysis in a way that makes sense because while it is a list

iris.test

    Pearson's product-moment correlation

data:  iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8270363 0.9055080
sample estimates:
      cor 
0.8717538 
class( iris.test )
[1] "htest"

Benefits

This is awesome because it makes it much easier to use something like the values stored in iris.test to insert the data from our analyses directly (inline) into our text.

There was a significant relationship between sepal and petal length (Pearson’s product-moment correlation, \(\rho =\) 0.872, \(t =\) 21.6, P = 1.04e-47).

In-Class Activity

Do a quick analysis of width for sepals and petals in the three iris species.

iris.test <- cor.test( iris$Sepal.Width, iris$Petal.Width )
  • What is the name of the list element that has the correlation statistic?
  • What are the names of the list element that has the test statistic and probability?
  • Does these data suggest there is a significant relationship? How do you know?

Data Frames

The Lingua Franca of Data Analysis

The main container that almost all of your data will be contained in is the data.frame.

  • Similar to a spreadsheet
  • Each column as the same data type (e.g., weight, longitude, survived)
  • Each row has all the observations for a given entity.

Example

Lets consider the following data as indiviudal vectors.

names <- c("Bob","Alice","Jane", "Norm")
homework.1 <- c(0.78, 0.95, 0.82, NA)
homework.2 <- c(NA, 0.89, 0.92, 0.79 )

Example

These can be put into a data.frame as:

gradebook <- data.frame( names, homework.1, homework.2 )
gradebook
  names homework.1 homework.2
1   Bob       0.78         NA
2 Alice       0.95       0.89
3  Jane       0.82       0.92
4  Norm         NA       0.79

Example

Each column in a data.frame is a self-contained set of data all of the same type and as such can be summarized.

summary( gradebook )
    names             homework.1      homework.2    
 Length:4           Min.   :0.780   Min.   :0.7900  
 Class :character   1st Qu.:0.800   1st Qu.:0.8400  
 Mode  :character   Median :0.820   Median :0.8900  
                    Mean   :0.850   Mean   :0.8667  
                    3rd Qu.:0.885   3rd Qu.:0.9050  
                    Max.   :0.950   Max.   :0.9200  
                    NA's   :1       NA's   :1       

Named Columns

Just like in a list, the columns of a data.frame are accessed by their names, and we can use the $ notation.

names(gradebook)
[1] "names"      "homework.1" "homework.2"
gradebook$homework.1
[1] 0.78 0.95 0.82   NA
is.na( gradebook$homework.2 )
[1]  TRUE FALSE FALSE FALSE

Indexing of Elements

The easiest way to index values in a data.frame is to use the $ notation to grab the column (as a vector object) and then to use the square brackets to access a specific element.

gradebook$homework.2[1]
[1] NA
gradebook$homework.2[1] <- 0.85
gradebook
  names homework.1 homework.2
1   Bob       0.78       0.85
2 Alice       0.95       0.89
3  Jane       0.82       0.92
4  Norm         NA       0.79

Indexing of Elements

You can also use the numerical indices for both row and column in the data.frame (n.b., it is row first then column).

gradebook
  names homework.1 homework.2
1   Bob       0.78       0.85
2 Alice       0.95       0.89
3  Jane       0.82       0.92
4  Norm         NA       0.79

 

gradebook[2,1]
[1] "Alice"

Dimensions

The size of the elements contained in a data.frame are then relevant.

dim( gradebook )
[1] 4 3

or individually

nrow( gradebook )
[1] 4
ncol( gradebook )
[1] 3

External Data

You will almost never create data.frame objects de novo but instead load data in from some external resource. There are several functions that simplify this within tidyverse so let’s make sure we have it loaded into memory.

 

library( tidyverse )

Example Data

Here is a CSV file that is contained in this repository. Since it is a public repository, we can access it from within GitHub using a URL.

url <- "https://raw.githubusercontent.com/DyerlabTeaching/Data-Containers/main/data/arapat.csv"
beetles <- read_csv( url )

Example Data

Araptus attenuatus

Data

dim( beetles )
[1] 39  3
head( beetles )
# A tibble: 6 × 3
  Stratum Longitude Latitude
  <chr>       <dbl>    <dbl>
1 88          -114.     29.3
2 9           -114.     29.0
3 84          -114.     29.0
4 175         -113.     28.7
5 177         -114.     28.7
6 173         -113.     28.4

A Tangent on Pipes

Remember René Magritte’s Pipe? I used this for a reason:

  • Data manipulation is a multi-step process
  • Intermediate steps can be messy
  • Leads to “spaghetti code”

We use the term “pipe” in the sense of making a connection of data flows from one step to the next.

\[ Load\;Data \to Format\;Dates \to Scale\;Values \to Make\;Plot \]

The Pipe Operator(s)

Originally, there was a library named magrittr that defined one of those compound operators. Where instead of doing the function call like this with function( data )

summary( beetles )

The Pipe Operator(s)

We could take the beetles object and pipe it (e.g., pass its values) into the summary function (as the first argument that summary receives).

 

beetles  %>%  summary

The Pipe Operator(s)

In fact, we could get rather expressive with this kind of piping and use built-in indentation rules to make the code significantly more readable.

Compare the following code that summarizes the first 10 entries in the beetles data set.

h <- head( beetles, n = 10)
summary( h )
beetles %>%
  head(n=10) %>%
  summary 

Three Characters are Just Too Much!

In fact, this became so popular, that the R language gurus decided to make a pipe operator that does not need the magrittr library at all (and is only 2 characters in length). You will see both of these operators in action.

The magrittr version.

beetles %>%
  head(n=10) %>%
  summary 

The built in version

beetles |>
  head(n=10) |>
  summary 

Example Data

beetles %>%
  leaflet::leaflet() |>
  leaflet::addProviderTiles(provider = leaflet::providers$Esri.WorldTopo) %>%
  leaflet::addMarkers( ~Longitude, ~Latitude,popup = ~Stratum )

In-Class

From the beetle data, how would you estimate the centroid coordinate of the data set?

Advanced Data Frames

Default Data Set

There are many built-in data sets that we can play with. Let’s copy one of these and then practice adding and deleating from it.

data <- mtcars
names(data)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

 

head( data )
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

 

summary( data )
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

Data Types

Some of the data are continuous

data$drat
 [1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93
[16] 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62
[31] 3.54 4.11

 

And some are discrete (we will come back to this distinction in a week or so):

unique( data$cyl )
[1] 6 4 8

Let’s Use A smaller set of data

data |>
  tail( n=4 ) -> data
data
                mpg cyl disp  hp drat   wt qsec vs am gear carb
Ford Pantera L 15.8   8  351 264 4.22 3.17 14.5  0  1    5    4
Ferrari Dino   19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
Maserati Bora  15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
Volvo 142E     21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

Deleting A Column

We can delete columns from a data.frame object by assigning it NULL.

data$vs <- NULL
data 
                mpg cyl disp  hp drat   wt qsec am gear carb
Ford Pantera L 15.8   8  351 264 4.22 3.17 14.5  1    5    4
Ferrari Dino   19.7   6  145 175 3.62 2.77 15.5  1    5    6
Maserati Bora  15.0   8  301 335 3.54 3.57 14.6  1    5    8
Volvo 142E     21.4   4  121 109 4.11 2.78 18.6  1    4    2

Chaining Deletes

data$am <- data$drat <- data$wt <- NULL
data
                mpg cyl disp  hp qsec gear carb
Ford Pantera L 15.8   8  351 264 14.5    5    4
Ferrari Dino   19.7   6  145 175 15.5    5    6
Maserati Bora  15.0   8  301 335 14.6    5    8
Volvo 142E     21.4   4  121 109 18.6    4    2

Deleting by Names

data["gear"] <- data["carb"] <- data["cyl"] <- NULL
data
                mpg disp  hp qsec
Ford Pantera L 15.8  351 264 14.5
Ferrari Dino   19.7  145 175 15.5
Maserati Bora  15.0  301 335 14.6
Volvo 142E     21.4  121 109 18.6

Adding a Column

data$Dreamy <- c(TRUE, TRUE, TRUE, FALSE)
data
                mpg disp  hp qsec Dreamy
Ford Pantera L 15.8  351 264 14.5   TRUE
Ferrari Dino   19.7  145 175 15.5   TRUE
Maserati Bora  15.0  301 335 14.6   TRUE
Volvo 142E     21.4  121 109 18.6  FALSE

Adding a Row

To add a row to a data.frame, we really need to make an identical data.frame and then bind it onto the bottom of it.

dyerVW <- data.frame( mpg = 21.4, disp=91, hp = 53, qsec = 20.9, Dreamy=TRUE)
dyerVW
   mpg disp hp qsec Dreamy
1 21.4   91 53 20.9   TRUE

Row Names

The mtcars data set has the name of the car in the data but not as a column of data itself… This is an older way of doing it and one that is not commonly used any more.

To make dyerVW have a name of the vehicle in the row, to make it like the mtcars one, we use the function rownames()

rownames( dyerVW ) <- "Volkswagen Beetle"
dyerVW
                   mpg disp hp qsec Dreamy
Volkswagen Beetle 21.4   91 53 20.9   TRUE

Binding Requirements

OK, so we are ready to bind them (let’s verify they have the same columns).

dyerVW
                   mpg disp hp qsec Dreamy
Volkswagen Beetle 21.4   91 53 20.9   TRUE

onto the original data set

data 
                mpg disp  hp qsec Dreamy
Ford Pantera L 15.8  351 264 14.5   TRUE
Ferrari Dino   19.7  145 175 15.5   TRUE
Maserati Bora  15.0  301 335 14.6   TRUE
Volvo 142E     21.4  121 109 18.6  FALSE

Row Binding

rbind( data, dyerVW)
                   mpg disp  hp qsec Dreamy
Ford Pantera L    15.8  351 264 14.5   TRUE
Ferrari Dino      19.7  145 175 15.5   TRUE
Maserati Bora     15.0  301 335 14.6   TRUE
Volvo 142E        21.4  121 109 18.6  FALSE
Volkswagen Beetle 21.4   91  53 20.9   TRUE

Did it Take?

data
                mpg disp  hp qsec Dreamy
Ford Pantera L 15.8  351 264 14.5   TRUE
Ferrari Dino   19.7  145 175 15.5   TRUE
Maserati Bora  15.0  301 335 14.6   TRUE
Volvo 142E     21.4  121 109 18.6  FALSE

Why?

Making It Stick

rbind( data, dyerVW) -> data 
data
                   mpg disp  hp qsec Dreamy
Ford Pantera L    15.8  351 264 14.5   TRUE
Ferrari Dino      19.7  145 175 15.5   TRUE
Maserati Bora     15.0  301 335 14.6   TRUE
Volvo 142E        21.4  121 109 18.6  FALSE
Volkswagen Beetle 21.4   91  53 20.9   TRUE

Informative Names

There are many times that we can use real names for columns of data. This is beneficial to use because when we plot it or make a table, if we use an abbreviated name like hp or qsec, we’ll have to fix the labels or do some other work around.

Dyer’s Rule #1: Use informative names for your data.

Use Your Big Words

names( data )
[1] "mpg"    "disp"   "hp"     "qsec"   "Dreamy"
names(data)[1] <- "MPG"
names(data)[2] <- "Displacement"
data
                   MPG Displacement  hp qsec Dreamy
Ford Pantera L    15.8          351 264 14.5   TRUE
Ferrari Dino      19.7          145 175 15.5   TRUE
Maserati Bora     15.0          301 335 14.6   TRUE
Volvo 142E        21.4          121 109 18.6  FALSE
Volkswagen Beetle 21.4           91  53 20.9   TRUE

Compound Names

Since we cannot have spaces in variable names (and the columns of a data.frame are just variables), we need to enclose a compound so R recognizes it as a single entity instead of 2 or more variable names.

names(data)[3] <- "Horse Power"
names(data)[4:5] <- c("Quarter Mile", "Dream Car")
data 
                   MPG Displacement Horse Power Quarter Mile Dream Car
Ford Pantera L    15.8          351         264         14.5      TRUE
Ferrari Dino      19.7          145         175         15.5      TRUE
Maserati Bora     15.0          301         335         14.6      TRUE
Volvo 142E        21.4          121         109         18.6     FALSE
Volkswagen Beetle 21.4           91          53         20.9      TRUE

Accessing Compund Names

data[["Horse Power"]]
[1] 264 175 335 109  53

With $-notation we need to take special care (because of the space(s) )

data$`Horse Power`
[1] 264 175 335 109  53

Sorting Data Frames

Occasionally, you’ll need to sort a data.frame to get some inference out of it (e.g., slowest Quarter Mile, best MPG, etc.) We can use the arrange() function (actually from dplyr but will be diving into it next week) to easily do this.

arrange( data, MPG )
                   MPG Displacement Horse Power Quarter Mile Dream Car
Maserati Bora     15.0          301         335         14.6      TRUE
Ford Pantera L    15.8          351         264         14.5      TRUE
Ferrari Dino      19.7          145         175         15.5      TRUE
Volvo 142E        21.4          121         109         18.6     FALSE
Volkswagen Beetle 21.4           91          53         20.9      TRUE

Sorting in Reverse

To sort in reverse, we use the negative character to indicate sorting in decreasing order.

arrange( data, -Displacement )
                   MPG Displacement Horse Power Quarter Mile Dream Car
Ford Pantera L    15.8          351         264         14.5      TRUE
Maserati Bora     15.0          301         335         14.6      TRUE
Ferrari Dino      19.7          145         175         15.5      TRUE
Volvo 142E        21.4          121         109         18.6     FALSE
Volkswagen Beetle 21.4           91          53         20.9      TRUE

Compound Sorting

We can sort the whole data.frame using multiple columns but adding them to the call as additional arguments (n.b., a logical sorts in numerical value with FALSE == 0 and TRUE > 0. suck it Volvo!!!).

arrange( data, -MPG, -`Dream Car`)
                   MPG Displacement Horse Power Quarter Mile Dream Car
Volkswagen Beetle 21.4           91          53         20.9      TRUE
Volvo 142E        21.4          121         109         18.6     FALSE
Ferrari Dino      19.7          145         175         15.5      TRUE
Ford Pantera L    15.8          351         264         14.5      TRUE
Maserati Bora     15.0          301         335         14.6      TRUE

Questions

If you have any questions, please feel free to post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored